Skip to content

Conversation

@phsm
Copy link
Contributor

@phsm phsm commented Feb 24, 2025

Description

This PR speeds up Prometheus exporter reply generation by ensuring the same host tag is not polled multiple times.

It utilizes HashSet instead of ArrayList to store the host tags, therefore deduplicates the tags when they are present on multiple hosts.
It fixes two bugs:

  1. Prevents duplication of the cloudstack_vms_total_by_tag{filter="<vm state>", zone="<zonename>", tags="<tagname>"} in the reply.
  2. Speeds up forming Prometheus exporter reply on large Cloudstack installations when multiple hosts with repeating tags are used, for example 200 hosts but only 5 unique host tags. The time to get the Prometheus exporter reply has reduced from 4.5 mins to 26 seconds in my setup.

Steps to reproduce

  1. Take a large Cloudstack installation with > 100 hosts
  2. Populate these hosts with the several unique tags (e.g. 5 tags). So the host1 gets some of those tags, host2 gets the same tags, host3 gets the same tags etc.
  3. time curl http://<cs.server.ip>:<prometheus.port>/metrics from the prometheus exporter port. Inspect the reply, and note the time it took to finish.
  4. Apply this patch.
  5. Run the same command from the Step 3, and check the difference in time taken to process the request.

Expected behavior:

  • The metrics with the same name + labels, such as cloudstack_vms_total_by_tag{filter="<vm state>", zone="<zonename>", tags="<tagname>"} shall not be duplicated in the reply.
  • The processing time shall not be taking more than a minute or so.

Actual behavior:

  • the cloudstack_vms_total_by_tag metrics with the same labels are not unique
  • it takes a longer time to process, than it is necessary

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@phsm phsm changed the title prometheus: don't poll the same tag multiple times fix: prometheus: don't poll the same tag multiple times Feb 24, 2025
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, needs testing of the integration so some prometheus users requested for verification.

@codecov
Copy link

codecov bot commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 15.99%. Comparing base (c80b886) to head (ed3d1a0).
Report is 37 commits behind head on 4.20.

Files with missing lines Patch % Lines
...che/cloudstack/metrics/PrometheusExporterImpl.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.20   #10450   +/-   ##
=========================================
  Coverage     15.99%   15.99%           
+ Complexity    13081    13080    -1     
=========================================
  Files          5649     5649           
  Lines        495648   495648           
  Branches      60006    60006           
=========================================
  Hits          79265    79265           
  Misses       407537   407537           
  Partials       8846     8846           
Flag Coverage Δ
uitests 4.01% <ø> (ø)
unittests 16.83% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rohityadavcloud rohityadavcloud added this to the 4.20.1 milestone Feb 24, 2025
@Pearl1594
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@Pearl1594 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12550

@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12572)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56038 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10450-t12572-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_06_purge_expunged_vm_background_task Failure 410.99 test_purge_expunged_vms.py

@Pearl1594 Pearl1594 merged commit 8b09295 into apache:4.20 Mar 7, 2025
23 of 25 checks passed
@Pearl1594 Pearl1594 moved this to Done in ACS 4.20.1 Mar 17, 2025
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants